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Scheduling  of  tasks  based  on  real  time  requirement  is  a  major  issue  in  the 
heterogeneous  multicore  systems  for  micro-grid  power  management. 
Heterogeneous  multicore  processor  schedules  the  serial  tasks  in  the  high 
performance  core  and  parallel  tasks  are  executed  on  the  low  performance 
cores.  The  aim  of  this  paper  is  to  implement  a  scheduling  algorithm  based  on 
fuzzy  logic  for  heterogeneous  multicore  processor  for  effective  micro-grid 
application.  Real  -  time  tasks  generally  have  different  execution  time  and 
dead  line.  The  main  idea  is  to  use  two  fuzzy  logic  based  scheduling 
algorithm,  first  is  to  assign  priority  based  on  execution  time  and  deadline  of 
the  task.  Second  ,  the  task  which  has  assigned  higher  priority  get  allotted  for 
execution  in  high  performance  core  and  remaining  tasks  which  are  assigned 
low  priority  get  allotted  in  low  performance  cores.  The  main  objective  of  this 
scheduling  algorithm  is  to  increase  the  throughput  and  to  improve  CPU 
utilization  there  by  reducing  the  overall  power  consumption  of  the  micro -grid 
power  management  systems.  Test  cases  with  different  task  execution  time 
and  deadline  were  generated  to  evaluate  the  algorithms  using  MATLAB 
software. 
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1.  INTRODUCTION 

The  explosive  growth  of  network  bandwidth  and  Internet  high  traffic  applications,  such  as  web 
browsing,  online  searching,  video  streaming,  and  gaming,  require  orders  of  magnitude  increase  in  packet 
processing  throughput.  The  advent  of  commodity  multicore  platforms,  such  as  Caviums  OCTEON[l],  Ciscos 
AON  ,  and  IBMs  Blade  Center[2,3]  ,  has  opened  a  new  era  of  computing  for  network  applications  to  take 
advantage  of  these  low  cost  machines  due  to  their  superiority  in  performance,  availability  and 
programmability.  More  and  more  network  packet  processing  systems  have  been  developed  on  such  platforms 
ranging  from  general-purpose  processors  (e.g.,  Intel’s  Xeon  and  AMD’s  Opteron)  to  network  processors 
(e.g.,  Intel’s  IXP  platform)  and  programmable  logic  devices  (e.g., Net  FPGA)  [4,5].  To  exploit  available 
parallelism  for  better  throughput,  network  applications  running  on  multicore  platforms  usually  take  one  of 
the  following  three  forms:  Spatial  parallelism,  where  multiple  concurrent  packets  are  processed  in  different 
processors  independently.  Typical  examples  can  be  found  in  work  for  TCP  (Transmission  Control  Protocol) 
parallelism  ,  scalable  DPI  (Deep  Packet  Inspection)  design  ,  flow  level  packet  processing]  and  parallel 
multimedia  transcoding.  Temporal  parallelism  (pipelining),  where  multiple  processors  are  scheduled  into  a 
pipeline  to  overlap  periodic  executions  from  different  threads  .  It  has  been  widely  adopted  in  network 
processors,  including  Shangri-La,  auto-partitioning,  statistical  approach  and  Greedy.  Hybrid  parallelism 
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integrates  both  spatial  and  temporal  parallelism  to  benefit  from  the  advantages  of  both  sides.  It  forms  a 
parallel  pipeline  core  topology,  where  each  stage  contains  multiple  parallel  cores,  such  as  Random  and  Bipar 
[6,7]. 

Traditional  task  scheduling  schemes,  such  as  list-based  scheduling  and  clustering  based  scheduling, 
are  capable  of  reducing  program  latency  by  exploiting  fine  grained  task-level  parallelism  [8,9,10].  The 
Performance  Analysis  of  Preemptive  Based  Uniprocessor  Scheduling  was  discussed  for  the  real  time  task 
[11]  .  The  Processor  Speed  Control  for  the  Real-Time  Systems  for  power  reduction  was  analysed  [12]  .But 
these  scheduling  schemes  do  not  apply  pipelining  process  and  they  suffer  from  significant  throughput 
deterioration  when  executing  periodic  packet  processing  tasks.  Many  researchers  presented  the  results  on 
reducing  protocol  latency  for  high-speed  gateways  and  telecommunication  systems  based  on  hybrid 
parallelism.  Developing  a  packet  processing  system  that  considers  both  latency  and  throughput  for  multicore 
architectures  is  both  interesting  and  challenging  [13,14].  Thus,  we  present  a  latency  and  throughput -aware 
scheduling  scheme  based  on  parallel-pipeline  topology. 

Along  with  increased  throughput  and  reduced  latency,  however,  comes  increased  power 
consumption  for  network  applications  running  on  multicore  architecture.  Collectively,  millions  of  servers  in 
the  global  network  consume  a  great  deal  of  power.  The  chip  manufactures  continue  to  increase  both  the 
number  of  cores  and  their  frequencies,  substantially  increasing  both  dynamic  and  static  power  consumption. 
Higher  power  consumption  increases  costs,  both  directly  and  indirectly.  Energy  itself  is  expected  to  become 
more  expensive,  especially  if  environmental  impacts  are  factored  into  consumption.  Higher  power 
consumption  also  increases  core  temperature,  which  exponentially  increases  the  cost  of  cooling  and 
packaging  .  Higher  temperatures  also  increase  indirect  and  life -cycle  costs  due  to  reduced  system 
performance,  circuit  reliability,  and  chip  life-time.  Therefore,  power  management  is  a  first-order  design  issue. 
As  we  propose  the  parallel-pipeline  scheduling  on  task-level,  we  realize  that  there  has  been  no  existing  work 
considering  the  power  budget  issues  for  it.  Previous  power-aware  algorithms  either  have  not  considered 
latency,  or  have  not  explored  the  parallel  pipeline  topology  for  task  scheduling.  Since  power  gating  cannot  be 
directly  applied  to  task  scheduling,  we  resort  to  DVFS  to  integrate  power-awareness  into  parallel  pipeline 
scheduling. 


2.  METHODOLOGIES 

2.1  Latency  And  Throughput  Aware  Scheduling  (LATA) 

We  propose  LATA,  a  latency  and  Throughput- Aware  packet  processing  system  for  multicore 
architectures.  It  adopts  hybrid  parallelism  with  parallel  pipeline  core  topology  in  fine-grained  task  level  to 
achieve  low  latency  and  high  throughput.  We  accomplish  the  above  goal  through  the  following  three  steps. 
First,  we  design  a  list-based  pipeline  scheduling  algorithm  from  the  task  graph.  Second,  we  apply  a 
deterministic  search-based  refinement  process  to  reduce  latency  and  improve  throughput  through  local 
adjustment.  Third,  we  devise  a  cache-aware  resource  mapping  scheme  to  generate  a  practical  mapping  onto  a 
real  machine. 

To  the  best  of  our  knowledge,  LATA  is  the  first  of  its  kind  to  consider  both  latency  and  throughput 
in  packet  processing  systems.  We  implement  LATA  on  an  Intel  machine  with  two  Quad -Core  Xeon  E5335 
processors  and  conduct  extensive  experiments  to  show  its  better  performance  over  other  systems  such  as 
Parallel,  Greedy,  Random  and  Bipar.  Based  on  six  real  packet  processing  applications  chosen  from  Net 
Bench  and  Packet  Bench,  LATA  exhibits  an  average  of  36.5%  reduction  of  latency  across  all  applications 
without  substantially  degrading  the  throughput.  It  shows  a  maximum  of  62.2%  reduction  of  latency  for  URL 
application  over  Random  with  comparable  throughput  performance. 

2.2  Lata  System  Design 

In  the  LATA’s  system  design  ,  we  first  generate  its  corresponding  task  graph  with  both  computation 
and  communication  information.  Then,  we  proceed  in  a  three-step  procedure  to  schedule  and  map  the  task 
graph  according  to  our  novel  design.  Last,  we  deploy  the  program  onto  a  real  multicore  machine  to  obtain  its 
performance  result. 
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Figure  1.  LATA  system  design  flow  chart. 


2.3  Communication  measurement 

The  communication  time  cannot  be  accurately  measured  between  two  cores  in  a  multicore 
architecture  unless  we  know  the  exact  location  of  the  cores.  In  LATA  design,  we  use  the  average 
communication  cost  based  on  data  cache  access  time,  as  given  in  Equations  1  and  2.  Comm  avg  means  the 
average  communication  cost  to  transfer  a  unit  data  set,  which  can  be  approximated  by  system  memory 
latencies  (LI,  L2  and  main  memory  access  time)  and  program  data  cache  performances  (LI  and  L2  cache  hit 
rate).  Data  Size  refers  to  the  transferred  data  set  size  between  two  communicating  tasks. 

Commavg=  Commavg .  Data  Size  (1) 

Commavg  =  ((TLj .  Hit  Lx)  +  (  TL2  .  (1  -  Hit  Lx  )).  HitL2  +  Tmem  .  ((1  -  Hit  Lx  ).  (1  -  Hit  L2  )))  (2) 

As  we  know,  the  throughput  can  be  calculated  by  the  inverse  of  the  longest  stage  time  1/Tmax  in 
pipelining.  Thus,  we  form  our  objective  function  in  Equation  3,  where  L  is  the  scheduled  latency. 

2.4  Problem  Statement 

The  latency  can  be  defined  as  the  schedule  length  of  a  program  and  throughput  as  the  system 
throughput.  The  problem  statement  is:  given  the  latency  constraint  L0,  schedule  a  packet  processing  program 
in  parallel  pipeline  core  topology  so  as  to  maximize  the  throughput  Th.  The  aim  is  to  rearrange  the  tasks  into 
the  parallel  pipeline  task  graph  as  shown  in  Figure  3,  so  that  the  total  execution  time  T1  +  T2  +  T3  +  T4  is 
minimized  while  maintaining  the  throughput  as  high  as  possible.  As  we  know,  the  throughput  can  be 
calculated  by  the  inverse  of  the  longest  stage  time  1/Tmax  in  pipelining.  Thus,  we  form  our  objective 
function  in  Equation  3,  where  L  is  the  scheduled  latency. 

1  (3) 

Maximize  Th  =  — - (  s.  t.  at  L  <  L0) 

^max 

3  PROPOSED  LATENCY  REDUCTION 

Latency  can  be  reduced  by  reducing  either  computation  time  or  communication  time.  Because 
computation  dominates  the  overall  execution  time  for  most  packet  processing  applications  running  on 
multicore  architectures,  we  prioritize  computation  reduction  in  designing  LATA.  Hence,  LATA  first  applies 
latency  hiding  to  reduce  computation  time  . 

Then,  CCP  elimination  and  CCP  reduction  are  used  to  reduce  communication  time.  Computation 
reduction:  We  defined  a  critical  node  as  the  node  in  a  pipeline  stage  which  dominates  the  computation  time. 
Then,  Latency  hiding  can  be  defined  as  a  technique  that  places  a  critical  node  from  one  stage  to  one  of  its 
adjacent  stages  without  violating  dependencies,  so  that  its  computation  time  is  shadowed  by  the  other  critical 
node  in  the  new  stage.  Backward  hiding  (BaH)  refers  to  placing  a  critical  node  into  its  precedent  stage. 
Forward  hiding  (FoH)  refers  to  placing  a  critical  node  into  its  following  stage  which  is  shown  in  Figure  2  . 
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Figure  2.  Latency  hiding  on  node  E. 


4  PERFORMANCE  EVALUATION 

The  latency  and  throughput  for  six  applications  by  LATA,  Parallel  and  List  are  shown  in  the  Figures 
3  and  4.  We  observe  that  Parallel  suffers  from  high  latency  due  to  its  sequential  execution  of  tasks.  Compared 
with  Parallel,  LATA  reduces  the  latency  by  an  average  of  34.2%. 


Figure  3.  Latency  of  six  applications  by  LATA, 
Parallel  and  List 


Figure  4.  Throughput  of  six  applications  by  LATA, 
Parallel  and  List 


Particularly,  for  URL,  LATA  achieves  the  maximal  latency  reduction  of  62.2%.  In  addition, 
LATA’s  throughput  is  close  to  that  of  Parallel  in  spite  of  the  75%  latency  constraint.  This  is  because  LATA 
is  capable  of  optimizing  its  parallel  pipeline  core  topology  to  produce  good  throughput.  With  respect  to  List, 
which  is  designed  to  produce  the  lowest  latency,  LATA  actually  matches  its  latency  performance  in  most 
cases  by  aggressively  exploiting  task-level  parallelism.  Furthermore,  LATA  outperforms  List  in  throughput 
by  an  average  of  41.0%  and  a  maximum  of  56.7%  for  Route. 


5.  POWER  AWARE  PARALLEL  PIPELINE  SCHEDULING  ALGORITHM 

We  introduce  the  novel  parallel-pipeline  scheduling  on  task-level  for  network  applications  that  can 
attain  high  throughput  under  given  latency  constraints.  In  this  chapter,  we  address  the  power  budget  issue  for 
this  scheduling  paradigm  for  network  packet  processing.  We  aim  at  optimizing  both  throughput  and  latency 
under  given  power  budget  by  appropriately  applying  per-core  DVFS.  We  propose  a  three-step  solution  to 
achieve  our  goal. 

5.1  A  three-step  recursive  algorithm 

STEP  1:  In  the  first  step,  we  reduce  the  power  without  compromising  throughput  or  latency  by 
keeping  the  pipeline  stage  time  Ti,  i  =  1,  2...  S  unchanged.  We  define  a  critical  node  as  the  node  in  a  pipeline 
stage  that  dominates  the  computation  time.  Therefore,  the  computation  time  of  a  critical  node  is  equal  to  the 
pipeline  stage  time  (ti  =  Ti).  For  each  stage  Si,  we  increase  the  computation  time  of  non-critical  nodes  in  that 
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stage  to  the  length  of  Ti.  Since  all  stage  times  remain  the  same,  the  throughput  and  the  latency  will  also  keep 
unchanged  during  this  step  which  is  depicted  in  the  Figure  5. 


Figure  5.  Illustration  of  the  first  step  of  the  algorithm 


Figure  6.  Illustration  of  the  second  step  of  the 
algorithm 


T  i=6  T2=6+aT  T3=6  T4=6 


Figure  7.  Illustration  of  the  third  step  of  the  algorithm 


STEP  2:  In  the  second  step,  we  reduce  the  power  with  throughput  unchanged  and  minimal  latency 
increase  which  is  shown  in  Figure  6  .  This  is  achieved  by  keeping  the  longest  stage  time  Tmax  unchanged 
while  we  increase  the  stage  time  of  other  stages.  We  denote  the  stage  with  Tmax  as  the  bottleneck  stage  in 
the  pipeline.  Thus,  all  other  stages  are  non-bottleneck  stages.  We  define  AT  as  the  shortest  time  period  by 
which  we  can  increase  the  latency.  To  minimize  the  latency  increase,  we  iteratively  increase  the  latency  by 
AT  until  the  power  budget  is  satisfied  or  all  the  stages  reach  Tmax.  If  the  former  comes  true,  the  algorithm 
returns  and  the  resulting  scheduling  guarantees  the  minimal  latency  increase,  which  will  be  proved  shortly. 
Otherwise,  if  the  latter  comes  true,  we  proceed  to  step  three. 

STEP  3:  In  the  third  step,  we  reduce  the  power  by  minimizing  both  the  throughput  and  the  latency 
performance  loss.  After  step  two,  every  stage  has  the  same  stage  time  Tmax.  Following  the  same  rule  of 
choosing  a  candidate  stage  in  step  two,  we  optimally  choose  a  stage  to  further  increase  its  stage  time  by  AT. 
Since  the  original  Tmax  is  increased,  the  throughput  is  compromised  accordingly  which  is  shown  in  the 
Figure  7  . 


6.  POWER  MODEL 

Consider  that  task  T  consists  of  C  clock  cycles  on  processor  P,  which  runs  at  voltage  V  and 
frequency  f.  We  assume  that  C  does  not  change  with  different  V  and  f.  For  a  given  voltage  V  ,  processor  P 
has  an  average  power  consumption  Pow.  It  is  known  that  processor  power  consumption  is  dominated  by 
dynamic  power  dissipation  given  by: 

Pow  =  Ka  .  f.  V2  (4) 
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Where  Ka  is  a  task/processor  dependent  factor  determined  by  the  switched  capacitance. 
The  energy  consumed  by  executing  task  T  on  processor  P  is  computed  as: 


E  =  C  .  Pow  f 

(5) 

We  can  rewrite  it  as: 

E  =  C.Ef 

(6) 

V=C.Ka.V2 

(7) 

Where  Ef,  V  is  the  average  cycle  energy. 

From  this  we  can  see  that  lowering  the  voltage  would  yield  a  drastic  decrease  in  energy 
consumption.  The  frequency  f  is  almost  linearly  related  to  the  voltage: 

f  =  Kb  .(V  -  VX)(V  —  VT)2  .V  (8) 

Where  VT  is  the  threshold  voltage  and  Kb  is  a  constant.  For  a  sufficiently  small  threshold  voltage, 
the  frequency  is  approximated  to  Kb. 

However,  our  algorithm  is  able  to  guarantee  a  minimal  performance  loss  in  this  scenario.  The  proof 
of  optimality  is  in  line  with  that  in  step  two,  where  the  minimal  time  period  increment  guarantees  that  when 
we  satisfy  the  power  budget  constraint,  the  performance  loss  is  minimal  which  is  shown  in  the  Figure  8. 


Figure  8.  The  power-aware  parallel-pipeline  scheduling  algorithm. 


In  our  system,  since  energy  increases  when  performance  degrades  (i.e.,  a  longer  execution  time),  we 
assign  a  higher  weight  to  performance  by  setting  ac  =  0.15  am.  Specifically,  the  lost  factors  for  GPU  cores 
and  memory  are  calculated  as: 

LCf  =  ac.Cfe  +(1-  ac)(l_C?p)  (9) 

Then  we  combined  core  and  memory  lost  functions  together  by  a  factor,  which  balances  core  impact 
and  memory  impact  in  influencing  system  performance  and  energy. 

Total  Loss  [j  =  0.1  C[  +  (1  —  0).l  mf  (10) 

Shows  how  total  lost  function  is  obtained.  For  different  CPU-GPU  systems,  by  tuning  cp  value  the 
system  can  achieve  balance  between  core  and  memory  influence.  In  our  hardware  tested,  0.33  is  the  value 
reflects  system  characteristic  derived  from  experiments.  Based  on  the  total  loss,  the  weights  used  in  the 
frequency  scaling  algorithm  can  be  updated  as  follows. 

weight  =  ((weight^  )(  1  -  (  1  -  (3).  Total  Loss  -  )  (11) 
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The  algorithm  will  be  more  robust  to  system  noise.  A  smaller  p  gives  more  weight  on  loss  factor  of 
current  time  interval.  The  algorithm  will  respond  to  workload  change  in  a  short  time.  In  our  experiment,  we 
select  p  =  0.2  to  filter  out  system  noise  with  quick  workload  change  response.  Almond  the  entire  N  x  M 
weights  (assume  we  have  N  core  frequency  levels  and  M  memory  frequency  levels),  the  highest  one  is 
selected  and  its  corresponding  core  and  memory  frequencies  are  enforced  in  the  next  period. 


6.  EXPERIMENTAL  RESULTS 

The  power  and  thermal-aware  scheduling  results  for  different  benchmarks  from  the  embedded 
system  synthesis  benchmarks  suite  or  generated  using  the  MATLAB  tool.  The  process  of  Task  scheduling  is 
shown  in  the  Figure  9.  The  ILP  solutions  were  generated  using  OPENMP.  The  simulations  were  performed 
on  a  dual  core  system  of  3  GHZ  processor.  The  output  execution  and  the  results  using  the  fuzzy  logic 
techniques  are  shown  in  the  Figures  10,llandl2.The  execution  time  and  the  relative  deadline  are  adjusted  to 
get  the  desired  priority  of  the  tasks.  The  modified  priority  order  based  on  the  fuzzy  logic  output  is  shown  in 
the  Figure  13.  The  corresponding  waveforms  are  depicted  in  the  Figure  14.  As  the  power  information  of  the 
processing  elements  in  these  benchmarks  is  not  available,  approximation  values  based  on  the  internal 
structure  of  each  core  is  used.  Based  on  these  approximated  values,  the  power  consumption  of  each  core  is 
adjusted  by  replacing  by  an  estimated  number  of  gates  in  the  module.  The  hardware  prototype  model  for  the 
Real  -  Time  task  scheduling  in  Heterogeneous  Multicore  Processor  for  Microgrid  Power  Management  is 
shown  in  the  Figurel5. 


Tasks 


Figure  9. Task  scheduling  process 


RSVanatfes 


yinwsfcta^ 


Figure  10.  Executing  output  according  to  Rules 


Figure  11.  Input  Waveform 
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Figure  12  .  Output  Waveforms 


Figure  13.  Adjusting  Execution  and  Deadline  for 
Getting  Priority 


Figure  14.  Output  Waveform  Figure  15.  Prototype  Hardware  model 


7.  CONCLUSION 

Current  research  on  GPU-CPU  systems  focuses  mainly  on  the  performance  aspects,  while  the 
energy  efficiency  of  such  systems  receives  much  less  attention.  There  are  few  existing  studies  that  start  to 
lower  the  energy  consumption  of  GPU-CPU  architectures,  but  they  address  either  GPU  or  CPU  in  an  isolated 
manner  and  thus  cannot  achieve  maximized  energy  savings.  In  this  paper,  we  have  presented  Green  GPU,  a 
holistic  energy  management  framework  for  GPU-CPU  heterogeneous  architectures.  Our  solution  features  a 
two-tier  design.  In  the  first  tier,  Green  GPU  dynamically  splits  and  distributes  workloads  to  GPU  and  CPU 
based  on  the  workload  characteristics,  such  that  both  sides  can  finish  approximately  at  the  same  time.  As  a 
result,  the  energy  wasted  on  staying  idle  and  waiting  for  the  slower  side  to  finish  is  minimized.  In  the  second 
tier,  Green  GPU  dynamically  throttles  the  frequencies  of  GPU  cores  and  memory  in  a  coordinated  manner, 
based  on  their  utilizations,  for  maximized  energy  savings  with  only  marginal  performance  degradation. 
Likewise,  the  frequency  and  voltage  of  the  CPU  are  scaled  similarly.  We  implement  Green  GPU  using  the 
CUD  A  framework  on  a  real  physical  tested  with  Nvidia  GeForce  GPUs  and  AMD  Phenom  II  CPUs. 
Experiment  results  with  standard  Rodinia  benchmarks  show  that  Green  GPU  achieves  21.04%  average 
energy  savings  and  outperform  several  well-designed  baselines. 
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